HHMM-based Chinese Lexical Analyzer ICTCLAS
نویسندگان
چکیده
This document presents the results from Inst. of Computing Tech., CAS in the ACLSIGHAN-sponsored First International Chinese Word Segmentation Bakeoff. The authors introduce the unified HHMM-based frame of our Chinese lexical analyzer ICTCLAS and explain the operation of the six tracks. Then provide the evaluation results and give more analysis. Evaluation on ICTCLAS shows that its performance is competitive. Compared with other system, ICTCLAS has ranked top both in CTB and PK closed track. In PK open track, it ranks second position. ICTCLAS BIG5 version was transformed from GB version only in two days; however, it achieved well in two BIG5 closed tracks. Through the first bakeoff, we could learn more about the development in Chinese word segmentation and become more confident on our HHMM-based approach. At the same time, we really find our problems during the evaluation. The bakeoff is interesting and helpful.
منابع مشابه
Chinese Lexical Analysis Using Hierarchical Hidden Markov Model
This paper presents a unified approach for Chinese lexical analysis using hierarchical hidden Markov model (HHMM), which aims to incorporate Chinese word segmentation, Part-Of-Speech tagging, disambiguation and unknown words recognition into a whole theoretical frame. A class-based HMM is applied in word segmentation, and in this level unknown words are treated in the same way as common words l...
متن کاملThe ICT,CAS MT Systems for the IWSLT09 Evaluation
We only use the data provided by the organizer for each task. We first used the Chinese lexical analysis system ICTCLAS for splitting Chinese characters into words and a rule-based tokenizer for tokenizing English sentences. Then, we convert all alphanumeric characters to their 2byte representation. Finally, we ran GIZA++ and used the “grow-diagfinal” heuristic to get many-to-many word alignmen...
متن کاملChinese Named Entity Recognition Using Role Model
This paper presents a stochastic model to tackle the problem of Chinese named entity recognition. In this research, we unify component tokens of named entity and their contexts into a generalized role set, which is like part-of-speech (POS). The probabilities of role emission and transition are acquired after machine learning on a role-labeled data set, which is transformed from a hand-correcte...
متن کاملChinese Word Segmentation Based On Direct Maximum Entropy Model
Chinese word segmentation is a fundamental and important issue in Chinese information processing. In order to find a unified approach for Chinese word segmentation, the author develop a Chinese lexical analyzer PCWS using direct maximum entropy model. The paper presents the general description of PCWS, as well as the result and analysis of its performance at the Second International Chinese Wor...
متن کاملA Hybrid Approach to Chinese-English Machine Translation
A hybrid method to Chinese-English machine translation is presented, a rule-based analysis is combined with statistical data. The rule-based lexical analyzer and syntactic analyzer leave some amount of ambiguity that are resolved using statistical approach. Hidden Markov Model(HMM) is used to return a score for each parts of speech, improved probabilistic context free grammar(PCFG) is used for ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003